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Abstract 

The NAS Grid Benchmarks ( NGB ) are a collection of 
synthetic distributed applications designed to rate the per- 
formance and functionality of computational grids . We 
compare several implementations of the NGB to deter- 
mine programmability and efficiency of NASA's Informa- 
tion Power Grid (IPG), whose services are mostly based on 
the Globus Toolkit We report on the overheads involved in 
porting existing NGB reference implementations to the IPG. 
No changes were made to the component tasks of the NGB. 


1 Introduction 

As computational grids are gaining more acceptance and 
prominence, tools are reouired to determine which rnmnn- 
nents of grids function well and which require improve- 
ment. To do this in a systematic way, a standard rating 
mechanism must be developed, i.e. grid benchmarks. Our 
approach is to develop such benchmarks primarily to serve 
grid users, since increased application programmer produc- 
tivity and application performance are the main goals of 
computational grids. Consequently, we have focused on 
characterizing actual distributed applications that are suit- 
able for execution on grids. The outcome of that work, the 
first publicly available grid benchmark suite, was released 
under the name NAS Grid Benchmarks (NGB), whose pre- 
cise specification is described in [7]. The motivation, back- 
ground and early experiences with NGB are reported in [3], 
along with a brief description of a reference implementation 
in Java. 

In this paper we discuss two implementations of NGB on 
NASA’s production computational grid called the Informa- 
tion Power Grid (IPG), and compare that to the earlier se- 
rial implementation. The performance results are reason for 
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guarded optimism that IPG may be beneficial for NASA’s 
application workload, but programmability and reliability 
can still be improved. 

The rest of this paper is structured as follows. In Section 
2 we briefly review the NGB, including the non-IPG ref- 
erence implementations. We also discuss some other grid 
benchmarking and monitoring projects. In Section 3 we de- 
scribe the software and hardware infrastructure of IPG rel- 
evant to our experiments. In Section 4 we give important 
details of our actual EPG NGB implementations. We dis- 
cuss performance and programmability results in Section 5. 

Some concluding remarks about the current state of the 
IPG, as well as recommendations for enhancement, are pre- 
sented in Section 6. 

2 Background 
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can be cast in the form of data flow graphs (DFGs). Such 
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Figure 1. Data flow graph representing ED 
NAS Grid Benchmark. 


applications are among the simplest in terms of software in- 
frastructure required, and hence they should be represented 
in a basic grid benchmarking suite. The DFGs for these 
benchmarks, named Embarrassingly Distributed (ED), He- 
lical Chain (HC), Visualization Pipe (VP), and Mixed Bag 
(MB), are depicted in Figures 1 and 2. The nodes of the 
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Figure 2. Data flow graphs representing HC, VP, and MB NAS Grid Benchmarks; numbers in italics and 
bold indicate thousands of words communicated between tasks for Classes S and W, respectively. 
Tasks are numbered according to the labels in the semi-circles. 


graphs, indicated by the rectangular boxes, are computa- 
tional tasks. Dashed arrows indicate control now between 
the tasks, and solid arrows indicate data as well as control 
flow. Communication volumes between tasks are indicated 
in Figure 2. Launch and Report do little work; they initi- 
ate execution of the task graph and collect performance and 
verification results, respectively. The other computational 
tasks, SP, BT, LU, MG, and FT, have some internal struc- 
ture, see Figure 3 in Section 4. They are derived from the 
NAS Parallel Benchmarks (NPB) and involve computations 
on sizable multi-dimensional arrays. When the arrays on 
connected nodes are of the same size, no data transforma- 
tion needs to take place, but if a node has more than one 
input arcs and/or if the arrays on connected nodes are of 
different size, an additional computation needs to occur to 
merge and/or interpolate data. This computation is carried 
out by a function called Mesh Filter [7]. The NGB are pa- 
rameterized and can be run for different array sizes, which 
are usually referred to as Classes. In this study we use the 
two smallest Classes, named S and W. 

ED, HC, VP, and MB highlight different aspects of com- 
putational <7r:ds ED reqirr^s q om tu. 1 1 ntn o fi ^ 

and all the tasks within the graph can be executed indepen- 
dently. It tests basic functionality of grids and does not tax 
their communication performance; but it does allow us to 
measure the cost of remote task creation. HC is totally se- 
quential at the graph level. Hence, any time spent on com- 
municating data between graph nodes is fully exposed and 
will show up in performance results. VP requires specify- 
ing concurrent execution of tasks in a DFG with nontrivial 
dependencies. It allows pipelining and overlapping commu- 
nication times with computational work. MB is similar to 
VP, but the nodes of the graph all have different amounts of 


computational work. 

Other projects in the area of grid benchmarking and eval- 
uation have thus far mainly concentrated on monitoring 
health and status of particular grid components. Examples 
are the Globus Heart Beat Monitor [10] (part of the Globus 
Toolkit [11]), the Network Weather Service [8], and Wren 
[9]. Recently, some efforts were made to determine which 
realistic applications and application classes qualify as po- 
tential grid benchmark examples [6, 12], but no final selec- 
tions have yet been made. The GRASP project [2] provides 
a collection of low level probes that mostly measure the net- 
work capabilities of grids. It distinguishes itself from most 
other grid measurement projects in the sense that the probes 
are formulated and implemented as complete grid applica- 
tions, including authentication, validation, and resource ver- 
ification. 


3 Information Power Grid 


The Information Power Grid [5] was conceived to unify 
NASA’s many and geographically dispersed computational 
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its layered architecture, for work flow creation and man- 
agement, scheduling, etc. At present much of the mid- 
dleware is provided by the Globus Toolkit, version 2.4.2 
[11], and this is the software that we use for implementing 
NGB on the IPG. In our experiments we also investigate 
the file transfer properties of two protocols that are not part 
of GT2.4, but that can be used with it: GridFTP, based on 
gsincf tp tools (version 3.0.6), and gsisep version 3.4. 
GridFTP, gsisep, and Globus all use the same authentica- 
tion method (Grid Security Infrastructure GSI). 

Systems that are currently part of the IPG include four 
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SGI Origin 2000 (02 K) and two Origin 3000 (03 K) sys- 
tems at NASA Ames in California, a PC Cluster and an 
02K system at NASA Glenn in Ohio, and two 02K and 
one 03 K systems at NASA Langley in Virginia [13]. For 
this study we use just the Origin systems at Ames. 

4 Implementations 

We describe two Globus implementations of NGB. The 
first, called NGBEPG-Ksh, uses Korn shell scripts to launch 
the tasks in the DFGs and to control data transfers. It is a 
direct extension of the serial NGB reference implementa- 
tion using a single local file system, available from NAS as 
part of the GridNPB3.0 package [14]. We will refer to this 
serial implementation as SHF-Ksh. The second Globus im- 
plementation, called NGBIPG-Jav, uses Java code to com- 
municate directly with GRAM [11] and GridFTP clients. 
Both Globus implementations use Fortran executables to do 
the computational work. At present these executables are 
serial themselves, but several can run simultaneously, de- 
pending on the parallelism present in the NGB DFGs. 

Most of the Origin systems at NAS employ schedulers 
that act as the default Globus jobmanagers. To prevent jobs 
from getting stalled in queues we specify the fork jobman- 
ager for all tasks. This is possible because all submitted 
jobs are of short duration and use minimal machine re- 
sources. Hence, they can run in machine partitions usu- 
ally dedicated to interactive debugging and administrative 
operations. Once the larger NGB Classes are run, how- 
ever, individual NGB tasks will require significantly more 
machine resources, and hence must be queued. Since HC, 
VP, and MB feature dependencies between tasks, indepen- 
dent scheduling of tasks would be deleterious to EPG per- 
formance, and an approach must be taken that submits an 
entire NGB task graph — including all dependencies — to a 
grid scheduler, for example as it is done in the Grid Naviga- 
tion System, [4], or in GRID Superscalar [1]. 

4.1 NGBIPG Korn shell implementation 



Figure 3. Flow chart of single node of Data 
Flow Graph. 


using the asynchronous model; the nodes must run in se- 
quence, but can be started asynchronously. Hence, we pro- 
vide two versions of HC, one synchronous, the other asyn- 
chronous. In the synchronous version, which is the most 
natural way of transforming the SHF-Ksh version for exe- 
cution on the grid, the nodes are launched in blocking mode 
(i.e. using globus run innon-batch mode). Only when the 
remote execution request returns is the next node launched. 
The following sequence describes the actions of the launch 
process. Run node n on host n; transfer output file from 
host n to host n 4* 1 ; run node n A 1 on host n - FI. Note that 
this requires uiiid-party copying if host n and host n + 1 
are different from each other and from the host running the 
launch process. Other than synchronous HC, all NGBIPG- 
Ksh benchmarks allow the choice between GridFTP and 
gsiscp to transfer files. 

4.2 NGBIPG Java implementation 


To express the concurrency of ED, VP, and MB we adopt 
an asynchronous model in which the nodes of the DFGs 
are issued independently at the NGB launch, using the 
globus run command with the batch flag ”-b” The nodes 
poll their local file systems for the presence and complete- 
ness of the required input files (if any) through the use of 
semaphores (which are zero byte files themselves). After all 
nodes are started, the launch script periodically queries their 
completion using globus- job -status. To avoid using 
too many machine resources, the launch and node scripts 
poll only once a second. 

The structure of the DFG nodes is indicated in Figure 3. 
Although HC is serial in nature, it can also be implemented 


NGBIPG-Jav implementation uses Gram Job and 
GridFTPCIient Classes of the cog- 1.1 oackage to access 
the GRAM and GridFTP servers. The actual access to 
the SHF-Ksh executables was accomplished by automatic 
creating shell scripts and coping them to the systems 
where the executables were invoked. The transfer of files 
was accomplished using the transfer method of the 
GridFTPCIient class and execution of the scripts was 
accomplished using the request method of the GramJob 
class. Time required for creation of the shell scripts and 
their transfers to the grid machines was included in the 
turnaround time of the benchmarks. 

For each NGB task we have to copy the following files: 









I 


the shell script which executes the mesh filter and the actual 
NPB task, the final report of the task, each of the resulting 
data files, and the semaphore indicating that the result is 
ready. We also have to execute the chmod command on the 
remote system to allow execution of the script. Overall, it 
translates into 2 * (1 + task output degree) accesses to the 
GridFTP clients and two GramJob requests for each graph 
node, where task output degree equals the number of output 
files. We estimate that our code adds 1-2% overhead to the 
actual benchmark turnaround time in the form of time spent 
to write scripts and to poll internal job submission threads. 

The submission of the NGB tasks was performed in a 
data flow manner. In other words, a task was submitted 
for execution on a statically specified host as soon as all 
its predecessors had finished their work and transferred the 
necessary files to the target machine. For ED it translates 
into simultaneous submission of all tasks; for HC into syn- 
chronous submission described in the previous subsection; 
and for VP and MB into submission of all tasks that were 
dynamically determined to be ready. 

4.3 Common implementation problems 

We identified the following issues in the development of 
the IPG versions of NGB, independent of the API chosen. 

Preliminaries: 

• Gridmap files containing IPG user information need 
to be updated on all grid machines by various system 
administrators before a proxy can be used there. The 
procedure is not transparent to the user. Messages from 
various administrators are confusing. 

• Globus jobs can only be started from systems on which 
the user has obtained a proxy. This makes access to 
the grid asymmetrical, and the user has to remember 
where the proxy was obtained. 

• It is possible to obtain a proxy on a host by copying the 
gridmap file from a machine on which an earlier proxy 
was obtained. This constitutes a security breach. 

File transfer: 

• GridFTP 

1. Does not allow third-party copying of files. 

2. Does not allow the renaming of the target file 
in the gsincf tp command line interface (but 
does allow it in the Java API). 

3. Does not retain the access permissions on files at 
the destination. For executables this necessitates 
issuing a ’’chmod” command after the file trans- 
fer completes, with the concomitant overhead of 
a remote execution. 


4. Will clobber a file that is copied onto itself, while 
no mechanism to avoid clobbering is provided. 

Elimination of the first and second GridFTP restric- 
tions is highly desirable for flexible programming. 
Third-party copying can always be accomplished 
through two two-party copies, but this is costly and 
usually leads to sequential bottlenecks. Target file re- 
naming is useful, for example, in the semaphore tech- 
nique used in NGBIPG-Ksh to signal successor nodes 
in the DFG that inputs have arrived. If a semaphore of 
the same name must exist on the sending host, it will 
not be possible to determine whether sending and re- 
ceiving host share the same file system (as is the case 
for a number of hosts at NAS). Hence, a file transfer 
that is intended to be a move (destroy the copy at the 
source if sender and receiver use different file systems) 
cannot be safely implemented. Item 4 is especially 
harmful, as for the user it is often difficult or costly 
to determine whether two files on different hosts are 
actually the same. Safety measures to avoid accidental 
destruction of files are cumbersome to implement. 

• globus_URL_copy 

1. Suffers from most of the ills that GridFTP does. 

2. Requires knowledge of absolute paths of files on 
the various hosts on which grid users have ac- 
counts, since there is no concept of user home 
directories. This severely inhibits transparency 
of grid use and makes applications non-portable. 

• gsiscp 

1. This file transfer protocol has all the properties 
desirable for convenient programming (target file 
renaming, no clobbering, persistence of permis- 
sions, third-party copying), but is reported to be 
slower than GridFTP and globus _URL_copy 
(however, see Section 5 for a comparison). 

Execution environment: 

• The exit status of a remotely executing command is not 
available to the globus run command that launched 
it. It has to be stored in a file and transferred to the 
place from which the globus run command was is- 
sued. 

• If the fork jobmanager is used to launch a re- 
mote script, the user’s shell resources files are not 
sourced, which means that no path will be avail- 
able, except that to globus commands through the 
$ GLOBUS JLOCATTON environment variable. If the 
batch scheduler jobmanager (Portable Batch System 
at NAS) is selected, resource files will get sourced. 
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This means that jobs may succeed or fail, depending 
on what jobmanager is selected. A solution is always 
to source resource files in scripts executing through 
globusrun. 

5 Results 

We present performance results for the following exper- 
iments and configurations. To ensure repeatability no grid 
schedulers were used; we explicitly prescribe the machines 
on which the tasks are to be run. The machines used in our 
experiments are listed in Table 1. 

Configurations boyd and dean consist of just those ma- 
chines, respectively. Configuration duo consists of dean and 
boyd. Configuration mix consists of dean, boyd, alan, dean, 
joe, grace, dean, harv, and boyd. Tasks in the DFGs are 
mapped round robin to the sequence of machines in these 
configurations. 

The turnaround time does not include deployment (copy- 
ing) of executables; it does include deployment of scripts 
and the verification. 

5.1 Korn Shell 

In the experiments reported in Table 2 we always used 
gsiscp to copy semaphore files between hosts (GridFTP 
was not satisfactory, since its command line interface does 
not allow file renaming). An alternative to copying is to 
create semaphore files directly on remote systems by is- 
suing touch commands through globusrun. However, 
in experiments with HC on configuration duo , the touch 
semaphore approach generally showed significantly poorer 
performance man gsiscp, as well as much greater vari- 
ability. For instance, 28 interleaved runs of HC Class S 
with both approaches showed average turnaround times that 
were 3.9 seconds longer for touch than for gsiscp. And 
whereas variability with gsiscp was within 17%, touch 
produced differences of up to 57%. 

In our first NGBEPG-Ksh implementation, which did not 
yet include MB, we did not provide methods to determine 
which hosts’ file systems were cross-mounted and which 
were not. Consequently, some redundant copying (over- 
writing) of files occurred in the gsiscp HC and VP ex- 
periments on configuration mix . We did not obtain per- 
formance results for the GridFTP versions of VP and HC 
on configuration mix, since file clobbering occurred on the 
cross-mounted file systems 1 . 

1 Once we had implemented MB and supplied it with a mechanism to 
detect file systems shared by hosts, the IPG had changed and all configura- 
tions shared the same file system. Thus, in the duo and mix configurations 
of MB no redundant copying of files ever occurred, and accidental clob- 
bering of files could be avoided. We did not rerun all other benchmarks 
using the same technique, because then we would not have been able to 
test the file transfer properties of gsiscp and GridFTP. 


As expected, in the single-system configuration experi- 
ments, dean performed significantly better than boyd , since 
dean has much faster processors. Moreover, ED, which 
has the highest level of concurrency, quickly saturates the 
CPUs available to the fork jobmanager on boyd, whereas 
dean, which has many more processors, can accommodate 
all nine tasks simultaneously without performance deteri- 
oration. The reason why ED Class S does not benefit as 
much from the faster processors as does Class W is that its 
performance is dominated by remote process creation and 
polling for results. This is demonstrated by comparing the 
IPG and non-IPG (loc) performance results of HC Class 
S; virtually all execution time is spent on process manage- 
ment in the IPG implementation. The effect of saturation is 
highlighted by the ED Class W experiments on configura- 
tion duo. Even though performance of ED is determined, in 
principle, by the slowest system in the configuration ( duo 
utilizes both boyd and dean), turnaround time on duo is less 
than half that on boyd. 

While HC is synchronous in nature, the asynchronous 
implementation always runs fastest. The reason is that re- 
mote process creation can be overlapped with computation, 
which the synchronous version does not allow. 

The combined results of the HC and VP runs show that 
gsiscp is significantly faster than GridFTP on our testbed. 
When files are shared by tasks on the same host, we never 
copy, which explains why the gsiscp and GridFTP results 
on configurations boyd and dean are not affected by the file 
transfer protocol. However, on configuration duo , where 
two hosts with different file systems are used, VP with 
gsiscp is 19 seconds faster than with GridFTP for both 
benchmark classes. For HC, which cannot hide transmis- 
sion costs, the difference is even larger. jL^ej^iwy ment Oi ex- 
ecutables and scripts is not reported in Table 2, since it takes 
place before the actual job launch. However, we did mea- 
sure these file copy speeds, and for HC and VP, GridFTP 
was between four and six times slower than gsiscp. For 
ED the difference was a factor of three to four. 

5.2 Java 

Turnaround times of the hTOBIPO-J?. v implementation 
are presented in Table 3. As was mentioned in Section 4, we 
perform 2 * (1 -p task output degree) accesses to the GridFTP 
clients and make two GramJob requests within each graph 
node (Because of the shared file system on some machines, 
not all GridFTP calls actually result in a file copy.) For 
Class W of HC and VP we estimate that about 2/3 of the 
turnaround time is spent accessing and delivering these IPG 
services, and 1/3 of the time for the actual processing. Ac- 
cess and use of the services mostly involves network and 
memory subsystems, which blunts the raw processor speed 
advantage of dean. Otherwise, the same arguments used to 



Table 1 . Grid Machines. The file systems of boyd, alan, joe, grace, and harv are cross-mounted, while 
dean has its own file system. 


% 


Name 

Clock Rate (MHz) 

Architecture 

Number of Processors 

Processor Type 

boyd 

250 

02K 

16 

MIPS R 10000 

alan 

250 

02K 

32 

MIPS R 10000 

grace 

250 

02K 

128 

MIPS R 10000 

joe 

400 

02K 

128 

MIPS R 12000 

harv 

400 

03K 

512 

MIPS R 12000 

dean 

600 

03K 

1024 

MIPS R 14000 


Table 2. Turnaround times (sec) for IPG (NGBIPG) and non-IPG (SHF) implementations using Korn 
shell scripts. Here conf indicates configuration, loc refers to running all tasks locally from a single 
script (SHF-Ksh). HC gsiscp-sync refers to the explicitly synchronous version. MB: see footnote on 
p.5. 
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boyd 

38 

40 

33 
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55 
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34 

34 

87 

86 


dean 

32 

30 

27 

27 

47 

1 

28 

27 

31 

31 

S 

duo 

36 

34 

51 

29 

78 


50 

31 

33 

34 


mix 

39 

37 
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115 



51 

37 

36 


boyd 

656 

650 

159 

159 

214 

128 

84 

83 

87 

86 


dean 

71 

69 

59 

58 

129 

50 

39 

39 

41 

42 

W 

duo 

273 

273 

142 

112 

203 


81 

62 

65 

65 


mix 

171 

172 


157 

283 



88 

56 

56 


explain the NGBIPG-Ksh turnaround times are valid. 


Table 3. Turnaround times (sec) for IPG 
(NGBIPG) implementations using Java. Here 
conf indicates configuration. 


Class 

conf 

ED 

HC 

VP 

MB 


boyd 

104 

251 

205 

165 


dean 

83 

242 

176 

117 

S 

duo 

101 

255 

192 

139 


mix 

93 

271 

200 

183 


boyd 

660 

320 

243 

228 


dean 

: 132 

247 

179 

165 

W 

duo 

404 

296 

231 

203 


mix 

230 

346 

214 

204 


We have to be careful when interpreting the performance 
results, because a number of factors were beyond our con- 
trol and not all could be measured independently. These 
include actual machine load, network traffic stability, and 
interference with other PG users, setup of the IPG ser- 


vices (timeout, polling intervals etc.). To reduce influence 
of these random factors we repeat our experiments multi- 
ple times and report the best results (shortest turnaround 
times). Variation of turnaround times among runs reported 
in Tables 2 and 3 was within 20%. Also, a number of our 
experiments failed because of PG authentication failure. In 
these cases we just repeat the experiment with a 10 minutes 
delay. 

Volatility of the performance of the grid machines is the 
major factor which can distort the performance of the IPG 
benchmarks. To gauge performance volatility we used the 
non-PG Java version of the NGB [14] and monitored grid 
machines during some of our experiments. The monitoring 
results using ED.S, depicted in Figure 4, show that perfor- 
mance of the grid machines usually is within 20% of the av- 
erage when all machines are up and the benchmark servers 
are properly installed. 

6 Conclusions 

It is important to recognize that our performance experi- 
ments are not exhaustive, and are focused on determining 
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Figure 4. Typical turnaround time of Java 
ED.S benchmark running on nine grid ma- 
chines. Grey dots indicate benchmark did not 
pass verification because of machine unavail- 
ability. Large peak indicates a setup process 
of a fresh server on one of the grid machines. 


overheads of currently implemented IPG services, which 
are likely to improve over time. The experiments, however, 
demonstrate the use of NGB to measure grid characteristics 
and as a pathfinding tool for grid application developers. 

One important aspect programmability of the IPG, was 
found to be in need of some improvement Unexpected 
file clobbering by GridFTP, absence of third-party copy- 
ing, and lack of functionality of its command line inter- 
face (no file renaming) leads to cumbersome programming. 
The exit status of jobs executed through giobusmn can- 
not be communicated to the calling process, other than 
through explicit capture and file transfer by the programmer, 
gl obus.ur l_copy requires explicit knowledge of remote 
absolute paths, which hampers portability. Execution of 
Unix commands through globus run requires knowledge 
of the paths to these commands on remote machines, which 
also hampers portability. 
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